Introduction

Coronavirus disease(COVID19) is an infectious disease caused by a newly discovered coronavirus. It has spread to numerous countries across all continent since the first discovery in Wuhan, China back in Nov 2019 and was declared as pandemic by WHO on March 11.

Various countries has came out measure/restrictions to respond to COVID-19. Since “circuit breaker”, a partial nationwide lockdown, where only essential services could allow opened, SG residents have started to feel a great impact on daily life where they are encouraged to stay home as much as possible and wearing of mask became mandatory when going out. SG government has constantly revising policy and social restrictions. Three phases of planned reopening were announced on 19 May namely “Safe Reopening” (Phase1) “Safer Transition” (Phase2), and finally “Safe Nation” (Phase3).

Problem Statement

Microblogging has become one of the most useful tools for sharing everyday life events and news and for expressing opinions about those events. As Twitter posts are short and constantly being generated, they are a great source for providing public sentiment towards events that occurred throughout the COVID-19 period in Singapore.

In our Capstone Project, we perform exploratory data analysis about SG COVID situation and sentiment analysis and modeling on the tweets about COVID19 to seek to answer the following research questions:

  1. What are the main prevalent sentiment and emotions expressed in words in Singapore tweets about current COVID situation?

  2. Is there any change of sentiment over a period of time amidst global reopening with higher vaccination rate, in contrast growing new daily cases/death locally?

For our data science project, we activated the following packages, using the Tidyverse approach.

# Load necessary packages
pacman::p_load(tidyverse, broom, modelr, lubridate, 
               tidytext, wordcloud2, wordcloud, reshape2,
               textdata,   # Employing Lexicon
               jtools, huxtable, gridExtra,
               psych, Hmisc, car, sandwich, 
               ggthemes, scales, ggstance, 
               gvlma, gtrendsR, rtweet, glue)

my_colors <- c("#05A4C0", "#85CEDA", "#D2A7D8", "#A67BC5", "#BB1C8B", "#8D266E", "gold4", "darkred", "deepskyblue4")

my_theme <- theme(plot.background = element_rect(fill = "grey98", color = "grey20"),
                  panel.background = element_rect(fill = "grey98"),
                  panel.grid.major = element_line(colour = "grey87"),
                  text = element_text(color = "grey20"),
                  plot.title = element_text(size = 22),
                  plot.subtitle = element_text(size = 17),
                  axis.title = element_text(size = 15),
                  axis.text = element_text(size = 15),
                  legend.box.background = element_rect(color = "grey20", fill = "grey98", size = 0.1),
                  legend.box.margin = margin(t = 3, r = 3, b = 3, l = 3),
                  legend.title = element_blank(),
                  legend.text = element_text(size = 15),
                  strip.text = element_text(size=17))

Import

Then, we imported our dataset.

Data Source 1: SG COVID DATA

SSA <- read.csv("Covid-19.csv")

Within the dataset, there are five questions about satisfaction with life (SWL), which will serve as our dependent (continuous y) variable.

Also, the dataset contains social comparisons orientation (11 items). They were asked by the following question.

  • Most people compare themselves from time to time with others. For example, they may compare the way they feel, their opinions, their abilities, and/or their situation with those of other people. Here is nothing particularly ‘good’ or ‘bad’ about this type of comparison, and some people do it more than others.

  • We would like to find out how often you compare yourself with other people. To do that we would like to ask you to indicate how much you agree with each statement below.

  1. I often compare myself with others with respect to what I have accomplished in life
  2. If I want to learn more about something, I try to find out what others think about it
  3. I always pay a lot of attention to how I do things compared with how others do things
  4. I often compare how my loved ones (boy or girlfriend, family members, etc.) are doing with how others are doing
  5. I always like to know what others in a similar situation would do
  6. I am not the type of person who compares often with others
  7. If I want to find out how well I have done something, I compare what I have done with how others have done
  8. I often try to find out what others think who face similar problems as I face
  9. I often like to talk with others about mutual opinions and experiences
  10. I never consider my situation in life relative to that of other people
  11. I often compare how I am doing socially (e.g., social skills, popularity) with other people

Out of the 11 questions, six were about social comparison regarding abilities (SCA, which will serve as our dependent variable (continuous x1); whereas the other five were about social comparisons regarding opinions (SCB).

Our moderator (categorical x2), gender, is contained as follows: 1 being male; 2 being female.

glimpse(SSA)
## Rows: 999
## Columns: 37
## $ Date                                                          <chr> "2020-01…
## $ Daily.Confirmed                                               <int> 1, 2, 1,…
## $ False.Positives.Found                                         <int> NA, NA, …
## $ Cumulative.Confirmed                                          <int> 1, 3, 4,…
## $ Daily.Discharged                                              <int> 0, 0, 0,…
## $ Passed.but.not.due.to.COVID                                   <int> 0, 0, 0,…
## $ Cumulative.Discharged                                         <int> 0, 0, 0,…
## $ Discharged.to.Isolation                                       <int> 0, 0, 0,…
## $ Still.Hospitalised                                            <int> 1, 3, 4,…
## $ Daily.Deaths                                                  <int> 0, 0, 0,…
## $ Cumulative.Deaths                                             <int> 0, 0, 0,…
## $ Tested.positive.demise                                        <int> 0, 0, 0,…
## $ Daily.Imported                                                <int> 1, 2, 1,…
## $ Daily.Local.transmission                                      <int> 0, 0, 0,…
## $ Local.cases.residing.in.dorms.MOH.report                      <int> NA, NA, …
## $ Local.cases.not.residing.in.doms.MOH.report                   <int> NA, NA, …
## $ Intensive.Care.Unit..ICU.                                     <int> 0, 0, 0,…
## $ General.Wards.MOH.report                                      <int> NA, NA, …
## $ In.Isolation.MOH.report                                       <int> NA, NA, …
## $ Total.Completed.Isolation.MOH.report                          <int> NA, NA, …
## $ Total.Hospital.Discharged.MOH.report                          <int> NA, NA, …
## $ Requires.Oxygen.Supplementation.or.Unstable                   <int> NA, NA, …
## $ Linked.community.cases                                        <int> NA, NA, …
## $ Unlinked.community.cases                                      <int> NA, NA, …
## $ Phase                                                         <chr> "", "", …
## $ Cumulative.Vaccine.Doses                                      <int> NA, NA, …
## $ Cumulative.Individuals.Vaccinated                             <int> NA, NA, …
## $ Cumulative.Individuals.Vaccination.Completed                  <int> NA, NA, …
## $ Perc.population.completed.at.least.one.dose                   <chr> "", "", …
## $ Perc.population.completed.vaccination                         <chr> "", "", …
## $ Sinovac.vaccine.doses                                         <int> NA, NA, …
## $ Cumulative.individuals.using.Sinovac.vaccine                  <int> NA, NA, …
## $ Doses.of.other.vaccines.recognised.by.WHO                     <int> NA, NA, …
## $ Cumulative.individuals.using.other.vaccines.recognised.by.WHO <int> NA, NA, …
## $ Number.taken.booster.shots                                    <int> NA, NA, …
## $ Perc.population.taken.booster.shots                           <chr> "", "", …
## $ X                                                             <lgl> NA, NA, …

Data Source 1: Tidy & Transform

The first thing we did with our loaded dataset was to created three variables (SWL, SCA, SCB) that contain the mean value of the items for each variable. Also, we recoded the gender variable.

SSA1<- tibble(SSA)

class(SSA)
## [1] "data.frame"
# View(SSA1)
SSA1 <- SSA1[-(1:626) , -(18:37)]
SSA1 <- SSA1[ , -(11:16)]  
SSA1 <- SSA1[ , -(3:8)]
SSA1 <- SSA1[-(35:373) , ]

SSA1$Date <- as.Date(SSA1$Date)
SSA1 %>% 
  ggplot(aes(x=Date)) +
  geom_line(aes(y = Daily.Confirmed), color = "darkred") + 
  geom_line(aes(y = Daily.Deaths), color="yellow") +
  geom_line(aes(y = Still.Hospitalised), color="blue")+
  geom_line(aes(y = Intensive.Care.Unit..ICU.), color="green")

glimpse(SSA1)
## Rows: 34
## Columns: 5
## $ Date                      <date> 2021-10-10, 2021-10-11, 2021-10-12, 2021-10…
## $ Daily.Confirmed           <int> 2809, 2263, 2976, 3190, 2932, 3445, 3348, 30…
## $ Still.Hospitalised        <int> 1583, 1668, 1589, 1477, 1481, 1563, 1434, 16…
## $ Daily.Deaths              <int> 9, 10, 11, 9, 15, 8, 9, 9, 6, 7, 18, 16, 14,…
## $ Intensive.Care.Unit..ICU. <int> 41, 42, 42, 46, 46, 48, 62, 66, 67, 71, 67, …
SSA1 %>% 
  select("Daily.Confirmed", "Daily.Deaths","Still.Hospitalised", "Intensive.Care.Unit..ICU." 
  ) %>% 
  summary(.)
##  Daily.Confirmed  Daily.Deaths   Still.Hospitalised Intensive.Care.Unit..ICU.
##  Min.   :1767    Min.   : 6.00   Min.   :1434       Min.   :41.00            
##  1st Qu.:2943    1st Qu.: 9.00   1st Qu.:1584       1st Qu.:58.25            
##  Median :3182    Median :12.00   Median :1640       Median :64.00            
##  Mean   :3206    Mean   :12.03   Mean   :1633       Mean   :61.88            
##  3rd Qu.:3472    3rd Qu.:14.75   3rd Qu.:1686       3rd Qu.:68.50            
##  Max.   :5324    Max.   :18.00   Max.   :1757       Max.   :75.00
# SSA1 %>% 
#   select("Daily.Confirmed", "Daily.Deaths","Still.Hospitalised", "Intensive.Care.Unit..ICU." 
#   ) %>% 
#   alpha(.)

Data Source 2: SG TWEETER DATA

# We observed 7-days data usually capped below 3000 tweets per extraction.
# sg_tweets <- search_tweets(q = "#coronavirus OR #covid19 OR #COVID OR #stayhome OR #Covid-19 OR #pandemic OR #virus OR #social-distance OR #self-quarantine OR $swab-test OR #PCR OR #infection-rate", 
#                                         n = 3000, 
#                                         lang = "en",
#                                         include_rts = F,
#                                         geocode = lookup_coords("singapore")
                                
sg_tweets <- read.csv("covid19_sg_tweeter_1010_1115.csv")

Let’s explore our tweets data with regards to COVID-19 since our first extraction on 18th October to understand sentiment after recent sharp rise in number of local cases and death since end-September.

We also identified 2 key events over the period to analyse further to answer our research question if the event will have significance influence on the public sentiment.

2021-10-20

  • PM Lee’s address on COVID-19 situation
  • Announcement on the extension of the Stabilisation Phase for four weeks, through to 21 November 2021.
  • Unvaccinated people can no longer eat at hawker centres, enter shopping malls.

2021-11-08

  • Allow up to five fully vaccinated persons from the same household to dine-in together at food and beverage (F&B) establishments
  • Loose restrictions on sports and selected MICE (Meetings, Incentives, Conferences and Exhibitions) events.
  • Resuming more activities in schools, in preparation for the larger-scale safe resumption of co-curricular learning activities in the coming school year.
  • Adjusting border measures and extending Vaccinated Travel Lanes(VTL) to Malaysia, Finland and Sweden.
# Basic EDA on amount of tweet in time (ALL)
options(repr.plot.width=20, repr.plot.height=9)

sg_tweets %>% 
  select(created_at) %>% 
  mutate(date = ymd(as.Date(created_at))) %>% 
  group_by(date) %>% 
  summarise(n = n(), .groups = "drop_last") %>%
  ggplot(aes(x=date, y = n)) + 
  geom_line(size = 1, color = my_colors[1]) +
  coord_cartesian(clip = 'off') +
  geom_vline(xintercept = as.Date('2021-10-20'), linetype="dotted", size = 1.5, color = "red") +
  geom_vline(xintercept = as.Date('2021-11-08'), linetype="dotted", size = 1.5, color = "darkblue") +
  my_theme + theme(axis.title.x = element_blank()) +
  scale_x_date(date_breaks = "1 day") + 
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 )) +
  labs(title = "Number of COVID-19 Tweets shared between 10th Oct - 15th Nov", subtitle = "Number of tweets spiked on key events") +
    geom_label(aes(x=as.Date('2021-10-19'), y = 380, label = "PM Lee's address on COVID-19"), color = "red", size = 4, angle = 90, fontface = "bold") +
    geom_label(aes(x=as.Date('2021-11-07'), y = 380, label = "More Opening on COVID-19 restrictions"  ), color = "darkblue", size = 4, angle = 90, fontface = "bold") 

Data Source 2: Tidy & Transform

# Step 1: Tokenization ----
sg_tweets_id <- sg_tweets %>% 
  mutate(created_at = as.Date(sg_tweets$created_at)) %>% 
  rowid_to_column("id")

tidy_tweets_token <- sg_tweets_id %>%
  drop_na(text) %>% 
  select(id, created_at, status_id, text) %>% 
  filter(text != "") %>% 
  unnest_tokens(word, text, token = "tweets") 

# Step 2: Cleaning ----
tweets_cleaned <- tidy_tweets_token %>%
  anti_join(tidytext::stop_words)

# Manual cleaning, filtering out unwanted words
tweets_token_cleaned <- tweets_cleaned %>%
  filter(!word %in% c("singapore", "covid", "covid19","positive","negative","oct","nov","news","amp","reuters","news","daily","malaysia","november","october","october","press","journal","amid","weekly","days","weeks","china","chinese","report","am","pm","dont","taking","found","morning","bloomberg","months","month","india","media","week","read","reports","data","europe","monday","tuesday","wednesday","thursday","friday","satursday","sunday","wall","street") & !str_detect(word,"^#|^@") & !str_detect(word, "^[:digit:]+$"))

Visualisation for Basic Exploratory Data Analysis

A Simple Word Cloud

covid_tweets_for_wc <- tweets_token_cleaned %>% 
  group_by(word) %>% 
  summarise(frequency = n()) %>% 
  arrange(desc(frequency))

covid_tweets_for_wc %>% 
  filter(frequency > 3) %>% 
  wordcloud2(backgroundColor = "black", 
             color = "random-light")

Word Cloud (Positive vs Negative)

# A Postive-Negative Word Cloud by using BING
BING <- get_sentiments("bing")

tweets_token_cleaned %>% 
  inner_join(BING, by="word") %>%
  count(word, sentiment, sort=T) %>% 
  acast(word ~ sentiment, value.var = "n", fill=0) %>% 
  comparison.cloud(colors=my_colors[c(5, 1)], max.words = 400, title.size = 2,
                   scale = c(3,.5))

Top 3 Most Negative Tweets in the dataset

AFINN <- get_sentiments("afinn")

## TOP 3 MOST NEGATIVE TWEET ----
tweets_AFINN_indexed <- tweets_token_cleaned %>% 
  inner_join(AFINN, by = "word")

tweet_level_sentiment <- tweets_AFINN_indexed %>% 
  group_by(id) %>% 
  summarise(average_sentiment = mean(value),
            n_of_words_indexed = n()
  )

top3_negative <- tweet_level_sentiment %>% 
  arrange(average_sentiment) %>% 
  head(3) 

sg_tweets_id %>% 
  filter(id %in% top3_negative$id ) %>% 
  select(text) %>% 
  pull()
## [1] "'They don't choose to have it': Why Covid-19 is hell for people with OCD"                                                                                                      
## [2] "Asked my mum to get me donuts 🍩 I’ve no idea why i did that when I can’t taste no shit. Arghh but just cravings 😫 fuck covid 19, fuck the monthly cycle, fuck everything! 😖"
## [3] "deadass my eyes went from covid 19 then positive bitch"

Top 3 Most Positive Tweets in the dataset

# TOP 3 MOST POSITIVE TWEETS ----
top3_positive <- tweet_level_sentiment %>% 
  arrange(desc(average_sentiment)) %>% 
  head(3)

sg_tweets_id %>% 
  filter(id %in% top3_positive$id) %>% 
  select(text) %>% 
  pull()
## [1] "…or else they find alternative employment. They are also bringing in a vax passport to enter some venues buildings etc, unless there is a current Covid-19 test (often valid only for a day or 2). It's basically becoming a no vax, no job, &amp; no fun situation. This I approve."                                                  
## [2] "“MOH said the doctor had no known medical conditions and was partially vaccinated with a non-mRNA Covid-19 vaccine under the Special Access Route.”\n\nWow, this has to be stated so explicitly. I wonder why? \n\n⁦⁩ ⁦@Huigoon⁩"                                                                                                          
## [3] "@DavidBieleski @bergeron_brent @katiehasedits This trend was evident on 9Aug 2020, Back then out of the 3, SG had the most cases with 55,104, whilst NZ, the least with 1,219. Even back then Greece the highest with 213 COVID-19 related deaths.\n\nFast forward to 5 Nov 2021, Greece is the winner in Covid-19 cases &amp; deaths."

Overall Emotion Analysis

Distribution Breakdown by Emotion Class using NRC technique

NRC <- get_sentiments("nrc")

options(repr.plot.width=15, repr.plot.height=9)

tweets_token_cleaned %>% 
  inner_join(NRC, "word") %>%
  filter(!sentiment %in% c("positive", "negative")) %>% 
  count(sentiment, sort=T) %>% 
  ggplot(aes(x=reorder(sentiment, n), y=n)) +
  geom_bar(stat="identity", aes(fill=n), show.legend=F) +
  geom_label(aes(label=format(n, big.mark = ",")), size=5, fill="white") +
  labs(x="Sentiment", y="Frequency", title="What is the overall mood in Tweets?") +
  scale_fill_gradient(low = my_colors[3], high = my_colors[1], guide="none") +
  coord_flip() + 
  my_theme + theme(axis.text.x = element_blank())

Most Fequent Words by Emotion Class

#options(repr.plot.width=25, repr.plot.height=9)

tweets_token_cleaned %>% 
  inner_join(NRC, "word") %>% 
  count(sentiment, word, sort=T) %>%
  group_by(sentiment) %>% 
  arrange(desc(n)) %>% 
  slice(1:7) %>% 
  ggplot(aes(x=reorder(word, n), y=n)) +
  geom_col(aes(fill=sentiment), show.legend = F) +
  facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 5) +
  coord_flip() +
  my_theme + theme(axis.text.x = element_blank()) +
  labs(x="Word", y="Frequency", title="Sentiment split by most frequent words") +
  scale_fill_manual(values = c(my_colors, "#BE82AF", "#9D4387", "#DEC0D7",
                               "#40BDC8", "#80D3DB", "#BFE9ED"))

Research on Influence from Singapore PM’s address

20 Oct 2021

Here, we are interested in a research question: Did COVID-19 key events within our sentiment analysis period on 20 Oct occured below changed the public sentiment or gain more trust in effects with the leadership?

  1. PM’s address the nation on COVID-19 situation in Singapore:
  • The path to a “New Normal”, diverted from original zero COVID approach and to live with COVID19.
  • local cases spiked sharply over the past few weeks
  • Asked for unity and COVID resilience.
  1. Announcement on COVID social curbs to be extended another month to 21 Nov. originally slated to be in place until 24 Oct.
  • Dining out to 2 people
  • Work from home remains the default.

We are going to use Regression Discontinuity Analysis on the causal inference and effect.

Firstly, we explore the data with 10 days before and after PM Lee’s address, assuming date close to the cut off on 20 Oct has more relevant effects.

Data Overview

Using AFINN technique

sentiment_daily <- tweets_AFINN_indexed %>% 
  group_by(created_at) %>% 
  summarise(average_sentiment = mean(value),
            n_of_words_indexed = n()) 

  # Plot
sentiment_daily %>% 
  filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30')) %>% 
  ggplot(aes(x = created_at, y = average_sentiment) ) +
  geom_point(size = 2, color = my_colors[1]) +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1, linetype="dotdash", color = my_colors[6]) +
  scale_x_date(date_breaks = "1 day") + 
  ggtitle("Distribution of Average Sentiment \n10 days before & after PM address") +
  ggthemes::theme_fivethirtyeight() +
  theme(axis.text.x = element_text(angle = 45, size = rel(0.6), vjust = 1, hjust = 1 ))

Using NRC technique for Emotion Classification

# Extract for analysis period
 tweets_token_analysis_period <- tweets_token_cleaned %>% 
  filter(created_at >= as.Date('2021-10-10') & created_at <= as.Date('2021-10-30')) 

classified_sentiment <- tweets_token_analysis_period %>% 
  inner_join(NRC, "word") %>% 
  group_by(sentiment, created_at) %>% 
  summarise(cnt = n()) 

# Plot Chart
classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>% 
  ggplot(aes(x=created_at, y =cnt, color = sentiment)) +
  geom_point() +
  facet_wrap(~sentiment, scales = "free_y", nrow = 2, ncol = 4) +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8]) +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d") +
    theme(axis.text.x = element_text(angle = 45, size = rel(0.8), vjust = 1, hjust = 1 )) +
  labs(y="Count of Emotional Words", x="Period of Date")

Using Radar Chart, another visualisation chart.

# Data transformation
char_sentiment <- classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
  group_by(sentiment, covid_address_effect) %>% 
  summarise(char_sentiment_count = sum(cnt)) 

total_char <-   classified_sentiment %>% 
  filter(!sentiment %in% c("positive", "negative")) %>%
  mutate (covid_address_effect = as.factor(ifelse(created_at < '2021-10-20','Before','After'))) %>%
  group_by(covid_address_effect) %>% 
  summarise(total = sum(cnt))

# Plot Chart
char_sentiment %>% 
  inner_join(total_char, by="covid_address_effect") %>% 
  mutate(percent = char_sentiment_count / total * 100 ) %>% 
  select(-char_sentiment_count, -total) %>% 
  spread(covid_address_effect, percent)  %>% 
  radarchart::chartJSRadar(showToolTipLabel = T, main="Any Effects on the Emotion Class Percentage After Address? - No", maxScale=25, responsive=T,addDots = T, colMatrix = grDevices::col2rgb(my_colors[c(2,4)]),lineAlpha = 0.7, polyAlpha =0.2)

Simple Linear Regression

OLS Linear Regression on average sentiment over time period

merged_dataset <- SSA1 %>% 
  inner_join(sentiment_daily, by = c("Date" = "created_at")) %>% 
  filter(Date >= as.Date('2021-10-10') & Date <= as.Date('2021-10-30')) 

# add dummy variable for pre-effect = 0, and post-effect = 1
merged_dataset <- merged_dataset %>% 
  mutate (covid_address_effect = as.factor(ifelse(Date < '2021-10-20','Before','After')))

merged_dataset %>% 
  lm(average_sentiment ~ covid_address_effect + I(Date - as.Date('2021-10-20')),.) %>% 
  summary()
## 
## Call:
## lm(formula = average_sentiment ~ covid_address_effect + I(Date - 
##     as.Date("2021-10-20")), data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36397 -0.09578  0.01303  0.08805  0.32045 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     -0.340705   0.080718  -4.221 0.000514 ***
## covid_address_effectBefore      -0.238246   0.150119  -1.587 0.129913    
## I(Date - as.Date("2021-10-20")) -0.007743   0.012382  -0.625 0.539602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1718 on 18 degrees of freedom
## Multiple R-squared:  0.2093, Adjusted R-squared:  0.1214 
## F-statistic: 2.382 on 2 and 18 DF,  p-value: 0.1209
merged_dataset %>% 
  ggplot(aes(x = Date, y = average_sentiment)) +
  geom_point(aes(color = covid_address_effect)) + 
  geom_smooth(method = "lm") +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  ggthemes::theme_fivethirtyeight() +
  ylab("Average Sentiment") +
  theme(axis.title.y = element_text(), legend.position = "bottom") +
  labs(title="OLS Simple Regression Model")

### Regression Discontinuity Analysis

We perform Regression Discontinuity Analysis on the effects of PM address event.

We expect to observe there is a high volume on Oct. 20, jump in sentiment score/count on Oct. 19 and Oct. 21

# RDD
RDD <- merged_dataset %>% 
  ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  ggthemes::theme_fivethirtyeight() + 
  ggtitle("Regression Discontinuity Analysis") +
  ylab("Average Sentiment") +
  theme(axis.title.y = element_text()) +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])

# Difference in Means Test
RDD_box <- merged_dataset %>% 
  ggplot(aes(x = Date, y = average_sentiment, color = covid_address_effect)) +
  geom_boxplot(outlier.colour="black",
               outlier.size=2, notch=FALSE) + 
  geom_point() +
  ggthemes::theme_fivethirtyeight() + 
  ggtitle("Test for Significant Difference") +
  scale_x_date(breaks = c(as.Date('2021-10-10'), as.Date('2021-10-20'), as.Date('2021-10-30')), date_labels = "%b %d", date_minor_breaks = "1 day") +
  geom_vline(xintercept = as.Date('2021-10-20'), size = 1,linetype="dotdash", color = my_colors[8])

gridExtra::grid.arrange(RDD, RDD_box, ncol=2)

Perform T-test to find significance in difference between 2 groups (Before and After PM address)

# Conduct a difference of means test
# Hypothesis: H0 : mean of pre-address_effect = mean of post-address_effect
merged_dataset %>%
  t.test(average_sentiment ~ covid_address_effect, .)
## 
##  Welch Two Sample t-test
## 
## data:  average_sentiment by covid_address_effect
## t = 2.1578, df = 18.219, p-value = 0.04453
## alternative hypothesis: true difference in means between group After and group Before is not equal to 0
## 95 percent confidence interval:
##  0.004272338 0.309624417
## sample estimates:
##  mean in group After mean in group Before 
##           -0.3794175           -0.5363659

Model

For the preparation of the model, we created and ran a correlational matrix, to see how our variables of interest (within the model) are related.

pacman::p_load(Hmisc, broom, DT)

SSA1 %>% 
  select(Daily.Confirmed, Daily.Deaths, Still.Hospitalised,Intensive.Care.Unit..ICU.) %>%
  as.matrix(.) %>% 
  rcorr(.) %>% 
  tidy(.) %>% 
  rename(variable_1 = column1,
         variable_2 = column2,
         corr = estimate) %>% 
  mutate(abs_corr = abs(corr)
  ) %>% 
  datatable(options = list(scrollX = T),
  ) %>% 
  formatRound(columns = c("corr", "p.value", "abs_corr"), 
              digits = 3)

We performed mean-centering transformations on all the variables that will be turned into interaction terms.

data <- SSA1 %>% 
  select(Daily.Confirmed, Daily.Deaths, Still.Hospitalised,Intensive.Care.Unit..ICU.) %>% 
  mutate_at(vars(Intensive.Care.Unit..ICU.:Still.Hospitalised),
            ~(. - mean(.,na.rm=T)))

Specify OLS model

We ran two regression models. The first regressed demographics and social comparisons orientations onto life satisfaction (model1).

\[ \begin{eqnarray} \widehat{swl} = intercept + b_1fem + b_2ageyear + b_3educ + b_4income + b_5sca + b_6scb + \epsilon \end{eqnarray} \]

Our key investigation lies in the next model, in which we regressed demographics and social comparisons orientations, along with interaction terms, onto life satisfaction (model2).

\[ \begin{eqnarray} \widehat{swl} = intercept + b_1fem + b_2ageyear + b_3educ + b_4income + b_5sca + b_6scb + \\ + b_7fem \times sca + b_8ageyear \times sca + b_9educ \times sca + b_10educ \times sca + \epsilon \end{eqnarray} \]

model1 <- lm(Daily.Deaths ~ Daily.Confirmed + Still.Hospitalised + Intensive.Care.Unit..ICU. , 
             data)

tidy(model1) %>% as_tibble()
termestimatestd.errorstatisticp.value
(Intercept)10.8     3.19    3.38 0.00202
Daily.Confirmed0.0003820.0009780.3910.699  
Still.Hospitalised-0.0005540.00909 -0.0610.952  
Intensive.Care.Unit..ICU.0.0588  0.0701  0.8390.408  
glance(model1)
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.0332-0.06353.550.3430.7943-89.21881963783034
model2 <- lm(Daily.Deaths ~ (Daily.Confirmed + Still.Hospitalised) * Intensive.Care.Unit..ICU. , 
             data)

tidy(model2) %>% as_tibble()
termestimatestd.errorstatisticp.value
(Intercept)10.9     3.5     3.11  0.00429
Daily.Confirmed0.0003820.0011  0.346 0.732  
Still.Hospitalised-0.00169 0.0103  -0.165 0.87   
Intensive.Care.Unit..ICU.-0.0179  0.46    -0.03890.969  
Daily.Confirmed:Intensive.Care.Unit..ICU.2.39e-050.0001570.152 0.88   
Still.Hospitalised:Intensive.Care.Unit..ICU.-0.0002690.00115 -0.234 0.817  
glance(model2)
r.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residualnobs
0.0361-0.1363.670.210.9565-89.11922033772834

We tested if model2, with interaction terms, enhances the explanatory power of the model using anova function.

anova(model1, model2)
Res.DfRSSDfSum of SqFPr(>F)
30378            
2837721.150.04270.958

The results of the analysis suggest that adding the interaction terms significantly increases the R-squared of model2, as compared to model1.

Assumption Check

Prof. Roh’s Note: “Here, please check the linearity assumption, using Global Validation of Linear Model Assumption (gvlma) package. You may visualize the core infomation of assumption checks, using ggfortify package.”

library(gvlma)
gvlma(model2)
## 
## Call:
## lm(formula = Daily.Deaths ~ (Daily.Confirmed + Still.Hospitalised) * 
##     Intensive.Care.Unit..ICU., data = data)
## 
## Coefficients:
##                                  (Intercept)  
##                                    1.087e+01  
##                              Daily.Confirmed  
##                                    3.821e-04  
##                           Still.Hospitalised  
##                                   -1.692e-03  
##                    Intensive.Care.Unit..ICU.  
##                                   -1.792e-02  
##    Daily.Confirmed:Intensive.Care.Unit..ICU.  
##                                    2.393e-05  
## Still.Hospitalised:Intensive.Care.Unit..ICU.  
##                                   -2.693e-04  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model2) 
## 
##                     Value p-value                Decision
## Global Stat        2.4730  0.6495 Assumptions acceptable.
## Skewness           0.1038  0.7473 Assumptions acceptable.
## Kurtosis           1.7981  0.1799 Assumptions acceptable.
## Link Function      0.4116  0.5212 Assumptions acceptable.
## Heteroscedasticity 0.1595  0.6896 Assumptions acceptable.
library(ggthemes)
theme_set(theme_fivethirtyeight())

library(ggfortify)
autoplot(gvlma(model2))

Check Multicollinearity

library(car)
vif(model1);vif(model2)
##           Daily.Confirmed        Still.Hospitalised Intensive.Care.Unit..ICU. 
##                  1.025765                  1.226518                  1.211086
##                              Daily.Confirmed 
##                                     1.222041 
##                           Still.Hospitalised 
##                                     1.463277 
##                    Intensive.Care.Unit..ICU. 
##                                    48.865490 
##    Daily.Confirmed:Intensive.Care.Unit..ICU. 
##                                    52.250322 
## Still.Hospitalised:Intensive.Care.Unit..ICU. 
##                                     1.460227

Report the Results with kable in R Markdown

Prof. Roh’s Note: “Now that the assumption check is done, you might want to put the results into a prettier format of table. The default print-out of table in R Markdown does not look good. The knitr package contains a very basic command, kable, which will format an array or data frame more presentable for display. Thus, use the following for your report.”

library(knitr) # Please install the package "knitr" first.
library(kableExtra) # You might want to use package "kableExtra" as well.
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:huxtable':
## 
##     add_footnote
## The following object is masked from 'package:dplyr':
## 
##     group_rows
kable(tidy(model2))%>%
  kable_styling("striped", full_width = T, fixed_thead = T) %>%
  column_spec(c(1, 5), bold = T) %>%
  row_spec(c(2, 4, 6), bold = T, color = "white", background = "#ff6347")
term estimate std.error statistic p.value
(Intercept) 10.8686811 3.4965288 3.1084203 0.0042883
Daily.Confirmed 0.0003821 0.0011032 0.3463539 0.7316660
Still.Hospitalised -0.0016917 0.0102596 -0.1648897 0.8702155
Intensive.Care.Unit..ICU. -0.0179243 0.4601972 -0.0389492 0.9692072
Daily.Confirmed:Intensive.Care.Unit..ICU. 0.0000239 0.0001573 0.1521081 0.8801924
Still.Hospitalised:Intensive.Care.Unit..ICU. -0.0002693 0.0011515 -0.2338565 0.8167978
kable(glance(model2))%>%
  kable_styling("striped", full_width = T, font_size = 12) %>%
  column_spec(c(2, 4, 6), bold = T, color = "white", background = "#ff6347")
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
0.0360927 -0.1360335 3.668687 0.2096876 0.9555829 5 -89.13761 192.2752 202.9597 376.8594 28 34

The regression analysis came up with two significant interaction terms.

  • First, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s gender.

  • Second, it appears that the relationships between social comparisons orientation regarding abilities and life satisfaction is different depending on one’s education.

To see the patterns of interaction, we visualized the significant interaction effects on the next chapter.

Visualize

Part 1: Social Comparison X Gender

To visualize the OLS regression analysis performed above, we stored the OLS regression model’s predictions.

# grid <- data2 %>% 
#   data_grid(sca, fem, ageyear = 0, educ = 0, income = 0, scb = 0) %>%
#   add_predictions(model2)

We undid the centering of variable (sca).

# grid <- grid %>% 
#   mutate(sca = sca + mean(data$sca))

The following figure represents the two lines that represent differing genders, and how each gender differs in its relationships between social comparison orientation and life satisfaction.

# ggplot(grid, aes(x = sca, y = pred, color = factor(fem))) +
#   geom_line(size = 2)+
#   scale_color_discrete(breaks = c(0, 1), label=c("Male", "Female")) +
#   labs(x="Social Comparion Orientation (Abilities)", 
#        y="Life Satisfaction",
#        color="Gender")+
#   coord_cartesian(ylim=c(1.5, 3.5), xlim=c(0.7, 5.3))+
#   theme_linedraw() +
#   theme(legend.position = "top")

We also plotted the two genders separately, along with data points represented with dots, so that we could see the patterns of the relationship from a different angle.

# ggplot(data, aes(x = sca, y = swl))+
#   geom_point(aes(color = factor(fem)), alpha = 0.3)+
#   geom_line(data = grid, aes(y = pred, color = factor(fem)), size = 1)+
#   labs(x="Social Comparion Orientation (Abilities)", 
#        y="Life Satisfaction",
#        color="Gender")+
#   coord_cartesian(ylim=c(0.7, 5.3), xlim=c(0.7, 5.3))+
#   facet_wrap(. ~ factor(fem),
#              labeller=as_labeller(c("0"="Male","1"="Female")))+
#   theme_linedraw() +
#   theme(legend.position="none")

Part 2: Comparison X Education

Unlike gender, which is a categorical variable, education is a continuous variable in our model. To make it easier for interpretation, we categorized them into three levels (mean below 1SD, mean, and mean above 1SD). We set gender at 0.5 instead of 0, as the ratio of male and female is equal.

# data2 %>% summarise(sd(educ))
# 
# grid_group3 <- data2 %>% 
#   data_grid(sca, educ = c(-1.26, 0.00, 1.26),
#             fem = 0.5, ageyear = 0, income = 0, scb = 0) %>% 
#   add_predictions(model2)

We undid the centering of variable (sca).

# grid <- grid_group3 %>% 
#   mutate(sca = sca + mean(data$sca), educ = factor(as.double(factor(educ))))

The following figure represents the three lines that represent differing education levels set at M-1SD, Mean, M+1SD, as noted above, and how differing education levels make differences to relationships between social comparison orientation and life satisfaction.

# ggplot(grid, aes(x = sca, y = pred, color = factor(educ))) +
#   geom_line(size = 2) +
#   scale_color_discrete(breaks = c(1, 2, 3), 
#                        label=c("Low in Education", 
#                                "Mean Education", 
#                                "High in Education")) +
#   labs(x = "Social Comparison Orientation (Abilities)", 
#        y = "Life Satisfaction",
#        color = "Education") +
#   coord_cartesian(ylim = c(2.0, 3.5), xlim = c(0.7, 5.3)) +
#   theme_linedraw() +
#   theme(legend.position= "top")

DO THIS WAY :)

Prof. Roh’s Note: “Yes, you might want to perform the analysis above, using jtools, huxtable, ggstance, and interactions packages, as you have learned in class.”

# pacman::p_load(jtools, huxtable, ggstance, interactions)
# 
# m1 <- lm(swl ~ fem + ageyear + educ + income + sca + scb, 
#          data = data)
# 
# m2 <- lm(swl ~ (fem + ageyear + educ + income) * sca + scb, 
#          data = data) 
# 
# export_summs(m1, m2, 
#              error_format = "(t = {statistic}, p = {p.value})",
#              align = "right",
#              model.names = c("Main Effects Only", "with Interactions"),
#              digits = 3)
# 
# plot_summs(m1, m2, 
#            scale = T,
#            plot.distributions = T,
#            model.names = c("Main Effects Only", "with Interactions")) +
#   theme(legend.position = "top")
# 
# sim_slopes(m2,
#            pred = sca, 
#            modx = fem,
#            johnson_neyman = F)
# 
# sim_slopes(m2,
#            pred = fem, 
#            modx = sca,
#            johnson_neyman = T)
# 
# set.seed(20210427)
# 
# interact_plot(m2, 
#               pred = "sca",
#               modx = "fem", 
#               modx.labels = c("Male",
#                               "Female"),
#               interval = T, 
#               int.width = 0.95,
#               colors = c("tomato3", 
#                          "deepskyblue4"),
#               vary.lty = T,
#               line.thickness = 1,
#               legend.main = "Gender",
#               plot.points = T,
#               jitter = 0.1) +
#   geom_vline(xintercept = 3.19, col = "red", linetype = 1, size = 1) +
#   annotate("text",
#            x = 2.1,
#            y = 4.7,
#            label = paste0("The shaded areas denote the boundary\nbetween regions ",
#                           "of significance and\nnon-significance based on alpha at 5%")
#            ) +
#   annotate("rect",
#            fill = "yellow",
#            alpha = 0.1,
#            xmin = 3.19,
#            xmax = 5,
#            ymin = 1,
#            ymax = 5) + 
#   labs(title = "The Interplay of Gender and Social Comparison\non Life Satisfaction",
#        subtitle = paste0("Among the female, the more they compare their abilities ",
#                          "with others,\nthere seems to be lesser life satisfaction."),
#        caption = "Source: Your Dataset",
#        x = "Social Comparison of Abilities (1-5)",
#        y = "Life Satisfaction (1-5)") + 
#   ggthemes::theme_fivethirtyeight() + 
#   theme(legend.position = "top",
#         text = element_text(family = "Courier"),
#         axis.text.y = element_text())

Interpretation of the Results

From our data science project, we could find the following two findings:

  1. The first case of COVID-19 found in Singapore was confirmed on 23 January 2020, and since nationwide partial lockdown or circuit breaker kicked in from 7 April 2020 to 1 June 2020, Singapore have been experiencing policies change from time to time to effectively cope with the surges. As new norm has been rooted in local lifestyle, no apparent jump or trend has been observed in public sentiment. In fact, we further analyse the effects from the most recent key announcements, a sudden increase of number of tweets and jump in overall sentiment appears that there is a causation effect, but it did not drive any specific emotion class to form trending. Despite that social curb extension announced expected to cause negative sentiment, PM address has outweighed the effects on it and sentiment emerge with more trusts to the leadership. Thus, information delivering, create awareness and generate leads seems to be effective in producing positive public sentiment.

Implications

Prof. Roh’s Note: “This is where you provide the significance of the findings. Unlike the other sections, where your goal is to describe the results that you found (what the data told you). This is where you chime in and proactively discuss the meaning of the results.”

Limitations and Future Directions

The public sentiment reflected and analysed by tweets data especially effective among younger age group who spend most of their time in social media and carefully tweets what is in their mind. This can serve only a small sample of a whole local population. If possible, we also need to filter out news publisher tweets and focus on individual tweets. In here, We are comparing sentiment change within group based on causal inference driven by nation-wide speech and announcements. On how COVID-19 has changed the sentiment in the society, we have to identify and aware of timeline on key events happened and also view from a longer term from pre-covid and covid era, hence a relative longer time period of data will provide more insights for analysing the differences. We are hold back by the free Twitter developer account which eligible to extract tweets up to 7 days in the past.

Local slang words or Singlish are casually and frequently used among local community and most of them could be a good gauge of emotion, E.g. even a dialect vulgar should reflect anger, thus enriching the emotion dictionary like NRC can better tune to understand the positive and negative in sentiment.

Feature Importance describe which features are relevant and is another aspect we should explore to help us better understanding of solved problem and sometimes lead to model improvements for accuracy by employing feature selection.

References

[1] Julia Silge & David Robinson Text Mining with R - A TIDY APPROACH O’Reilly (2017)

[2] Andrea Cirillo R Data Mining - Implement data mining techniques through practical use cases and real-world datasets Packt> (2017)

[3] Tony Carilli R Companion to Real Econometrics February 2021 https://bookdown.org/carillitony/bailey/chp11.html

[4] Ashwin Malshe(2019-06-25) Data Analytics Applications https://ashgreat.github.io/analyticsAppBook/collect-tweets.html

Appendix

Custom Stop-words for Text Pre-processing in Word Cloud data overview “singapore”, “covid”, “covid19”,“positive”,“negative”,“oct”,“nov”,“news”,“amp”,“reuters”,“news”,“daily”, “malaysia”,“november”,“october”,“october”,“press”,“journal”,“amid”,“weekly”,“days”,“weeks”,“china”, “chinese”,“report”,“am”,“pm”,“dont”,“taking”,“found”,“morning”,“bloomberg”,“months”,“month”,“india”, “media”,“week”,“read”,“reports”,“data”,“europe”,“monday”,“tuesday”,“wednesday”,“thursday”,“friday”, “satursday”,“sunday”,“wall”,“street”

The objective is to clean those are less relevant and very little meaning to find sentiment, such as punctuation, special character, prefix with number, hashtag, alias, links and custom terms above.

We removed duplicated text in tweets, sent from the same screen name multiple times. For instance, there are several news publishers have posted the same tweet on different days.

sessionInfo()

sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4   knitr_1.33         ggfortify_0.4.12   DT_0.19           
##  [5] glue_1.4.2         rtweet_0.7.0.9000  gtrendsR_1.4.8     gvlma_1.0.0.3     
##  [9] ggstance_0.3.5     scales_1.1.1       ggthemes_4.2.4     sandwich_3.0-1    
## [13] car_3.0-11         carData_3.0-4      Hmisc_4.5-0        Formula_1.2-4     
## [17] survival_3.2-13    lattice_0.20-44    psych_2.1.9        gridExtra_2.3     
## [21] huxtable_5.4.0     jtools_2.1.4       textdata_0.4.1     reshape2_1.4.4    
## [25] wordcloud_2.6      RColorBrewer_1.1-2 wordcloud2_0.2.1   tidytext_0.3.1    
## [29] lubridate_1.7.10   modelr_0.1.8       broom_0.7.9        forcats_0.5.1     
## [33] stringr_1.4.0      dplyr_1.0.7        purrr_0.3.4        readr_2.0.1       
## [37] tidyr_1.1.4        tibble_3.1.5       ggplot2_3.3.5      tidyverse_1.3.1   
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-2    ellipsis_0.3.2      rio_0.5.27         
##  [4] htmlTable_2.2.1     base64enc_0.1-3     fs_1.5.0           
##  [7] rstudioapi_0.13     farver_2.1.0        SnowballC_0.7.0    
## [10] fansi_0.5.0         xml2_1.3.2          splines_4.1.0      
## [13] mnormt_2.0.2        jsonlite_1.7.2      cluster_2.1.2      
## [16] dbplyr_2.1.1        png_0.1-7           compiler_4.1.0     
## [19] httr_1.4.2          backports_1.2.1     assertthat_0.2.1   
## [22] Matrix_1.3-4        fastmap_1.1.0       cli_3.1.0          
## [25] htmltools_0.5.2     tools_4.1.0         gtable_0.3.0       
## [28] rappdirs_0.3.3      Rcpp_1.0.7          cellranger_1.1.0   
## [31] jquerylib_0.1.4     vctrs_0.3.8         svglite_2.0.0      
## [34] nlme_3.1-152        crosstalk_1.1.1     xfun_0.25          
## [37] openxlsx_4.2.4      rvest_1.0.1         lifecycle_1.0.1    
## [40] pacman_0.5.1        zoo_1.8-9           hms_1.1.1          
## [43] parallel_4.1.0      yaml_2.2.1          curl_4.3.2         
## [46] pander_0.6.4        sass_0.4.0          rpart_4.1-15       
## [49] latticeExtra_0.6-29 stringi_1.7.4       highr_0.9          
## [52] tokenizers_0.2.1    checkmate_2.0.0     zip_2.2.0          
## [55] systemfonts_1.0.2   commonmark_1.7      rlang_0.4.12       
## [58] pkgconfig_2.0.3     evaluate_0.14       labeling_0.4.2     
## [61] htmlwidgets_1.5.4   tidyselect_1.1.1    plyr_1.8.6         
## [64] magrittr_2.0.1      R6_2.5.1            generics_0.1.1     
## [67] DBI_1.1.1           mgcv_1.8-36         pillar_1.6.4       
## [70] haven_2.4.3         foreign_0.8-81      withr_2.4.2        
## [73] abind_1.4-5         nnet_7.3-16         janeaustenr_0.1.5  
## [76] crayon_1.4.2        utf8_1.2.2          tmvnsim_1.0-2      
## [79] tzdb_0.1.2          rmarkdown_2.10      jpeg_0.1-9         
## [82] radarchart_0.3.1    grid_4.1.0          readxl_1.3.1       
## [85] data.table_1.14.2   webshot_0.5.2       reprex_2.0.1       
## [88] digest_0.6.28       munsell_0.5.0       viridisLite_0.4.0  
## [91] bslib_0.2.5.1

Contribution Statement

Prof. Roh’s Note: “Please describe your individual contribution to the team’s project (in detail).”